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Abstract 


We develop a general variational inference method that preserves dependency 
among the latent variables. Our method uses copulas to augment the families of 
distributions used in mean-field and structured approximations. Copulas model the 
dependency that is not captured by the original variational distribution, and thus 
the augmented variational family guarantees better approximations to the posterior. 

With stochastic optimization, inference on the augmented distribution is scalable. 
Furthermore, our strategy is generic: it can be applied to any inference procedure 
that currently uses the mean-field or structured approach. Copula variational in¬ 
ference has many advantages: it reduces bias; it is less sensitive to local optima; 
it is less sensitive to hyperparameters; and it helps characterize and interpret the 
dependency among the latent variables. 

1 Introduction 

Variational inference is a computationally efficient approach for approximating posterior distribu¬ 
tions. The idea is to specify a tractable family of distributions of the latent variables and then to min¬ 
imize the Kullback-Leibler divergence from it to the posterior. Combined with stochastic optimiza¬ 
tion, variational inference can scale complex statistical models to massive data sets [9, 23, 24]. 

Both the computational complexity and accuracy of variational inference are controlled by the fac¬ 
torization of the variational family. To keep optimization tractable, most algorithms use the fully- 
factorized family, also known as the mean-field family, where each latent variable is assumed inde¬ 
pendent. Less common, structured mean-field methods slightly relax this assumption by preserving 
some of the original structure among the latent variables [19]. Factorized distributions enable effi¬ 
cient variational inference but they sacrifice accuracy. In the exact posterior, many latent variables 
are dependent and mean-field methods, by construction, fail to capture this dependency. 

To this end, we develop copula variational inference (copula vi). Copula vi augments the traditional 
variational distribution with a copula, which is a flexible construction for learning dependencies in 
factorized distributions [3]. This strategy has many advantages over traditional vi: it reduces bias; 
it is less sensitive to local optima; it is less sensitive to hyperparameters; and it helps characterize 
and interpret the dependency among the latent variables. Variational inference has previously been 
restricted to either generic inference on simple models — where dependency does not make a signif¬ 
icant difference—or writing model-specific variational updates. Copula vi widens its applicability, 
providing generic inference that finds meaningful dependencies between latent variables. 

In more detail, our contributions are the following. 

A generalization of the original procedure in variational inference. Copula vi generalizes vari¬ 
ational inference for mean-field and structured factorizations: traditional vi corresponds to running 
only one step of our method. It uses coordinate descent, which monotonically decreases the KL 
divergence to the posterior by alternating between fitting the mean-field parameters and the copula 
parameters. Figure 1 illustrates copula vi on a toy example of fitting a bivariate Gaussian. 

Improving generic inference. Copula vi can be applied to any inference procedure that currently 
uses the mean-field or structured approach. Further, because it does not require specific knowledge 
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Figure 1: Approximations to an elliptical Gaussian. The mean-field (red) is restricted to fitting 
independent one-dimensional Gaussians, which is the first step in our algorithm. The second step 
(blue) fits a copula which models the dependency. More iterations alternate; the third refits the mean- 
field (green) and the fourth refits the copula (cyan), demonstrating convergence to the true posterior. 


of the model, it falls into the framework of black box variational inference [15]. An investigator 
need only write down a function to evaluate the model log-likelihood. The rest of the algorithm’s 
calculations, such as sampling and evaluating gradients, can be placed in a library. 

Richer variational approximations. In experiments, we demonstrate copula vi on the standard 
example of Gaussian mixture models. We found it consistently estimates the parameters, reduces 
sensitivity to local optima, and reduces sensitivity to hyperparameters. We also examine how well 
COPULA VI captures dependencies on the latent space model [7]. Copula vi outperforms competing 
methods and significantly improves upon the mean-field approximation. 


2 Background 

2.1 Variational inference 

Let X be a set of observations, z be latent variables, and A be the free parameters of a variational 
distribution g(z; A). We aim to find the best approximation of the posterior p(z | x) using the vari¬ 
ational distribution q{z; A), where the quality of the approximation is measured by KL divergence. 
This is equivalent to maximizing the quantity 

C{\) = E,(^.A)[logp(x,z)] - Eg(^.A)[logg(z; A)]. 

£(A) is the evidence lower bound (elbo), or the variational free energy [25]. For simpler computa¬ 
tion, a standard choice of the variational family is a mean-field approximation 

d 

9(z; A) = Wq^{zp,Xi), 

i=l 

where z = (zi,..., z^). Note this is a strong independence assumption. More sophisticated ap¬ 
proaches, known as structured variational inference [19], attempt to restore some of the dependencies 
among the latent variables. 

In this work, we restore dependencies using copulas. Structured vi is typically tailored to individual 
models and is difficult to work with mathematically. Copulas learn general posterior dependencies 
during inference, and they do not require the investigator to know such structure in advance. Further, 
copulas can augment a structured factorization in order to introduce dependencies that were not 
considered before; thus it generalizes the procedure. We next review copulas. 


2.2 Copulas 


We will augment the mean-field distribution with a copula. We consider the variational family 


9(z) 


d 


c(Q(zi),...,Q(zd)). 
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Figure 2: Example of a vine V which factorizes a copula density of four random variables 
c(ui, U 2 , U 3 , U 4 ) into a product of 6 pair copulas. Edges in the tree Tj are the nodes of the lower level 
tree Tj+i, and each edge determines a bivariate copula which is conditioned on all random variables 
that its two connected nodes share. 


Here Q{'Zi) is the marginal cumulative distribution function (CDE) of the random variable z^, and 
c is a joint distribution of [0,1] random variables.* The distribution c is called a copula of z; it 
is a joint multivariate density of (5(zi),..., Q(Z(i) with uniform marginal distributions [21]. Eor 
any distribution, a factorization into a product of marginal densities and a copula always exists and 
integrates to one [14]. 

Intuitively, the copula captures the information about the multivariate random variable after elimi¬ 
nating the marginal information, i.e., by applying the probability integral transform on each variable. 
The copula captures only and all of the dependencies among the z^’s. Recall that, for all random vari¬ 
ables, Q{'Zi) is uniform distributed. Thus the marginals of the copula give no information. 

Eor example, the bivariate Gaussian copula is defined as 

c(ui,U2;p) = $p($"*(ui),4>"*(u2)). 

If Ui, U 2 are independent uniform distributed, the inverse CDE 4)“* of the standard normal trans¬ 
forms (ui, U 2 ) to independent normals. The CDE <i)p of the bivariate Gaussian distribution, with 
mean zero and Pearson correlation p, squashes the transformed values back to the unit square. Thus 
the Gaussian copula directly correlates Ui and U 2 with the Pearson correlation parameter p. 

2.2.1 Vine copulas 

It is difficult to specify a copula. We must find a family of distributions that is easy to compute with 
and able to express a broad range of dependencies. Much work focuses on two-dimensional copulas, 
such as the Student-f, Clayton, Gumbel, Prank, and Joe copulas [14]. However, their multivariate ex¬ 
tensions do not flexibly model dependencies in higher dimensions [4]. Rather, a successful approach 
in recent literature has been by combining sets of conditional bivariate copulas; the resulting joint is 
called a vine [ 10 , 13]. 

A vine V factorizes a copula density c(ui,..., Ucj) into a product of conditional bivariate copulas, 
also called pair copulas. This makes it easy to specify a high-dimensional copula. One need only ex¬ 
press the dependence for each pair of random variables conditioned on a subset of the others. 

Pigure 2 is an example of a vine which factorizes a 4-dimensional copula into the product of 6 pair 
copulas. The first tree Ti has nodes 1,2, 3,4 representing the random variables Ui, U 2 , U 3 , U 4 respec¬ 
tively. An edge corresponds to a pair copula, e.g., 1,4 symbolizes c(ui, U 4 ). Edges in Ti collapse 
into nodes in the next tree T 2 , and edges in T 2 correspond to conditional bivariate copulas, e.g., 
1, 2|3 symbolizes c(ui, U 2 IU 3 ). This proceeds to the last nested tree T 3 , where 2,4|13 symbolizes 

’We overload the notation for the marginal CDF Q to depend on the names of the argument, though we oc¬ 
casionally use Qi{zi) when more clarity is needed. This is analogous to the standard convention of overloading 
the probability density function q{-). 
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c(u2, U4|ui, U3). The vine structure specifies a complete factorization of the multivariate copula, 
and each pair copula can be of a different family with its own set of parameters: 


C(U1,U2,U3,U4) 


C(ui, U3)C(U2, U3)C(U3, U4) 


C(U1,U2|U3)C(U1,U4|U3) 


C(U2,U4|U1,U3) 


Formally, a vine is a nested set of trees V = {Ti,..., Td-i} with the following properties: 

1. Tree Tj = {Nj,Ej} has d-\-l — j nodes and d — j edges. 

2. Edges in the tree Ej are the nodes in the {j + 1)*^ tree Nj+i. 

3. Two nodes in tree T}_|_i are joined by an edge only if the corresponding edges in tree Tj 
share a node. 


Each edge e in the nested set of trees {Ti,..., Td- 1 } specifies a different pair copula, and the product 
of all edges comprise of a factorization of the copula density. Since there are a total of d{d — l)/2 
edges, V factorizes c(ui,..., u^;) as the product of d{d — l)/2 pair copulas. 

Each edge e(i, k) G Tj has a conditioning set D{e), which is a set of variable indices 1,..., d. We 
define Cik\D{e) to be the bivariate copula density for and given its conditioning set: 


Cik\D{e) = c(^Q(u,|uj : j G D{e)),Q{ui\uj : j G D{e)) : j G D{e)^. 


( 1 ) 


Both the copula and the CDE’s in its arguments are conditional on D{e). A vine specifies a factor¬ 
ization of the copula, which is a product over all edges in the d — 1 levels: 


d-l 

c(ui,... ,Ud;r/) = c,k\D{e)- ( 2 ) 

j = l e{i,k)eEj 


We highlight that c depends on r), the set of all parameters to the pair copulas. The vine construction 
provides us with the flexibility to model dependencies in high dimensions using a decomposition of 
pair copulas which are easier to estimate. As we shall see, the construction also leads to efficient 
stochastic gradients by taking individual (and thus easy) gradients on each pair copula. 


3 Copula variational inference 

We now introduce copula variational inference ( copula vij, our method for performing accurate and 
scalable variational inference. Eor simplicity, consider the mean-field factorization augmented with 
a copula (we later extend to structured factorizations). The copula-augmented variational family is 


c(Q(zi;A),...,Q(zd;A);r 7 ), (3) 

'-V-' 

copula 

mean-field 

where A denotes the mean-field parameters and rj the copula parameters. With this family, we max¬ 
imize the augmented elbo, 

C (A, 77 ) = E,( 2 ;a, 77 ) [logp(x, z)] - [log ^(z; A, r/)]. 

Copula vi alternates between two steps: 1 ) fix the copula parameters rj and solve for the mean-field 
parameters A; and 2) fix the mean-field parameters A and solve for the copula parameters rj. This 
generalizes the mean-field approximation, which is the special case of initializing the copula to be 
uniform and stopping after the first step. We apply stochastic approximations [18] for each step with 
gradients derived in the next section. We set the learning rate pt G K to satisfy a Robbins-Monro 
schedule, i.e., Pt = 00, P? < 00. A summary is outlined in Algorithm 1 . 

This alternating set of optimizations falls in the class of minorize-maximization methods, which in¬ 
cludes many procedures such as the EM algorithm [1], the alternating least squares algorithm, and the 
iterative procedure for the generalized method of moments. Each step of copula vi monotonically 
increases the objective function and therefore better approximates the posterior distribution. 


g(z; A,r7) = 
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Algorithm 1: Copula variational inference (copula vi) 


Input: Data x, Model p(x, z), Variational family q. 
Initialize A randomly, t] so that c is uniform, 
while change in elbo is above some threshold do 
// Fix rj, maximize over A. 

Set iteration counter t = \. 
while not converged do 

Draw sample u ^ Unif([0, 

Update A = A + ptVxC. (Eq.5, Eq.6) 
Increment t. 

end 

// Eix A, maximize over rj. 

Set iteration counter t = 1. 
while not converged do 

Draw sample u ^ Unif([0, 

Update r) = r) + pt\7rid^- (Eq.7) 

Increment t. 


end 


end 

Output: Variational parameters (A,rj). 


Copula vi has the same generic input requirements as black-box variational inference [15] —the user 
need only specify the joint model p(x, z) in order to perform inference. Eurther, copula variational in¬ 
ference easily extends to the case when the original variational family uses a structured factorization. 
By the vine construction, one simply fixes pair copulas corresponding to pre-existent dependence in 
the factorization to be the independence copula. This enables the copula to only model dependence 
where it does not already exist. 

Throughout the optimization, we will assume that the tree structure and copula families are given 
and fixed. We note, however, that these can be learned. In our study, we learn the tree structure using 
sequential tree selection [ 2 ] and learn the families, among a choice of 16 bivariate families, through 
Bayesian model selection [ 6 ] (see supplement). In preliminary studies, we’ve found that re-selection 
of the tree structure and copula families do not significantly change in future iterations. 

3.1 Stochastic gradients of the elbo 

To perform stochastic optimization, we require stochastic gradients of the elbo with respect to both 
the mean-field and copula parameters. The copula vi objective leads to efficient stochastic gradients 
and with low variance. 

We first derive the gradient with respect to the mean-field parameters. In general, we can apply the 
score function estimator [15], which leads to the gradient 

Va>C = E,( 2 ;.A,^)[VAlog 9 (z; A,r 7 ) • (logp(x,z) - log g(z; A, 17 ))]. (4) 

We follow noisy unbiased estimates of this gradient by sampling from q{-) and evaluating the inner 
expression. We apply this gradient for discrete latent variables. 

When the latent variables z are differentiable, we use the reparameterization trick [17] to take ad¬ 
vantage of first-order information from the model, i.e.,Vz logp(x, z). Specifically, we rewrite the 
expectation in terms of a random variable u such that its distribution s(u) does not depend on the 
variational parameters and such that the latent variables are a deterministic function of u and the 
mean-field parameters, z = z(u; A). Eollowing this reparameterization, the gradients propagate 
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inside the expectation, 

Va£ = E^(u)[(Vzlogp(x,z) - Vzlogg(z; A,?7))Vaz(u; A)]. (5) 

This estimator reduces the variance of the stochastic gradients [17]. Furthermore, with a copula vari¬ 
ational family, this type of reparameterization using a uniform random variable u and a deterministic 
function z = z(u; A, rj) is always possible. (See the supplement.) 

The reparameterized gradient (Eq.5) requires calculation of the terms Vz-log g(z; A, 77 ) and 
VAiZ(u; A, rj) for each i. The latter is tractable and derived in the supplement; the former decom¬ 
poses as 

Vz, logg(z; A,J 7 ) = Vz^ logg(zi; A^) -f Vq(z,;ao log c(Q(zi; Ai), ..., (3(zd; A^); ? 7 )VziQ(zi; A^) 

d-l 

= Vzi \ogq{zi;Xi) +q{Zi;Xi) E E VQ(z,;Ai) l0gCfc£|£)(e). (6) 

i=l e(k,l)eEj: 
ie{k,e} 

The summation in Eq .6 is over all pair copulas which contain Q{zi; A^) as an argument. In other 
words, the gradient of a latent variable Zi is evaluated over both the marginal q{zi) and all pair 
copulas which model correlation between z^ and any other latent variable zj. A similar derivation 
holds for calculating terms in the score function estimator. 

We now turn to the gradient with respect to the copula parameters. We consider copulas which are 
differentiable with respect to their parameters. This enables an efficient reparameterized gradient 

Vr,C = Es(u) [(Vz logp(x, z) - Vz log 9 (z; A, 77 ))V^z(u; A, rj)]. (7) 

The requirements are the same as for the mean-field parameters. 

Einally, we note that the only requirement on the model is the gradient Vz logp(x, z). This can 
be calculated using automatic differentiation tools [22]. Thus Copula vi can be implemented in a 
library and applied without requiring any manual derivations from the user. 

3.2 Computational complexity 

In the vine factorization of the copula, there are d{d — l)/2 pair copulas, where d is the number of 
latent variables. Thus stochastic gradients of the mean-field parameters A and copula parameters rj 
require 0{d^) complexity. More generally, one can apply a low rank approximation to the copula by 
truncating the number of levels in the vine (see Eigure 2). This reduces the number of pair copulas 
to he Kd for some K > 0, and leads to a computational complexity of 0{Kd). 

Using sequential tree selection for learning the vine structure [2], the most correlated variables are at 
the highest level of the vines. Thus a truncated low rank copula only forgets the weakest correlations. 
This generalizes low rank Gaussian approximations, which also have 0{Kd) complexity [20]: it is 
the special case when the mean-field distribution is the product of independent Gaussians, and each 
pair copula is a Gaussian copula. 

3.3 Related work 

Preserving structure in variational inference was first studied by Saul and Jordan [19] in the case of 
probabilistic neural networks. It has been revisited recently for the case of conditionally conjugate 
exponential familes [ 8 ]. Our work differs from this line in that we learn the dependency structure 
during inference, and thus we do not require explicit knowledge of the model. Eurther, our augmen¬ 
tation strategy works more broadly to any posterior distribution and any factorized variational family, 
and thus it generalizes these approaches. 

A similar augmentation strategy is higher-order mean-field methods, which are a Taylor series correc¬ 
tion based on the difference between the posterior and its mean-field approximation [11]. Recently, 
Giordano et al. [5] consider a covariance correction from the mean-field estimates. All these methods 
assume the mean-field approximation is reliable for the Taylor series expansion to make sense, which 
is not true in general and thus is not robust in a black box framework. Our approach alternates the 
estimation of the mean-field and copula, which we find empirically leads to more robust estimates 
than estimating them simultaneously, and which is less sensitive to the quality of the mean-field 
approximation. 
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Figure 3: Covariance estimates from copula variational inference (copula vi), mean-field (mf), and 
linear response variational Bayes (lrvb) to the ground truth (Gibbs samples), copula vi and lrvb 
effectively capture dependence while mf underestimates variance and forgets covariances. 


4 Experiments 

We study copula vi with two models: Gaussian mixtures and the latent space model [7]. The Gaus¬ 
sian mixture is a classical example of a model for which it is difficult to capture posterior dependen¬ 
cies. The latent space model is a modern Bayesian model for which the mean-held approximation 
gives poor estimates of the posterior, and where modeling posterior dependencies is crucial for un¬ 
covering patterns in the data. 

There are several implementation details of copula vi. At each iteration, we form a stochastic gra¬ 
dient by generating m samples from the variational distribution and taking the average gradient. We 
set m = 1024 and follow asynchronous updates [16]. We set the step-size using ADAM [12]. 

4.1 Mixture of Gaussians 

We follow the goal of Giordano et al. [5], which is to estimate the posterior covariance for a Gaussian 
mixture. The hidden variables are a AT-vector of mixture proportions tt and a set of K P-dimensional 
multivariate normals A[r^), each with unknown mean (a P-vector) and P x P precision 

matrix A^. In a mixture of Gaussians, the joint probability is 

K N 

p(x,z,/x, A-^tt) =p( 7 r) A^:^) p(x„ | z„, A;;^)p(z„ | tt), 

k—1 n—1 

with a Dirichlet prior p{n) and a normal-Wishart prior , A^^). 

We hrst apply the mean-held approximation (mf), which assigns independent factors to /x, tt, A, and 
z. We then perform copula vi over the copula-augmented mean-held distribution, i.e., one which 
includes pair copulas over the latent variables. We also compare our results to linear response varia¬ 
tional Bayes (lrvb) [5], which is a posthoc correction technique for covariance estimation in varia¬ 
tional inference. Higher-order mean-held methods demonstrate similar behavior as lrvb. Compar¬ 
isons to structured approximations are omitted as they require explicit factorizations and are not black 
box. Standard black box variational inference [15] corresponds to the mf approximation. 

We simulate 10, 000 samples with K = 2 components and P = 2 dimensional Gaussians. Figure 
3 displays estimates for the standard deviations of A for 100 simulations, and plots them against the 
ground truth using 500 effective Gibb samples. The second plot displays all off-diagonal covariance 
estimates. Estimates for /x and tt indicate the same pattern and are given in the supplement. 

When initializing at the true mean-held parameters, both copula vi and lrvb achieve consistent 
estimates of the posterior variance, mf underestimates the variance, which is a well-known limita¬ 
tion [25]. Note that because the mf estimates are initialized at the truth, copula vi converges to the 
true posterior upon one step of htting the copula. It does not require alternating more steps. 
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Variational inference methods 

Predictive Likelihood 

Runtime 

Mean-field 

-383.2 

15 min. 

LRVB 

-330.5 

38 min. 

COPULA VI (2 steps) 

-303.2 

32 min. 

COPULA VI (5 steps) 

-80.2 

1 hr. 17 min. 

COPULA VI (converged) 

-50.5 

2 hr. 


Table 1: Predictive likelihood on the latent space model. Each copula vi step either refits the mean- 
field or the copula, copula vi converges in roughly 10 steps and already significantly outperforms 
both mean-field and lrvb upon fitting the copula once (2 steps). 


Copula vi is more robust than lrvb. As a toy demonstration, we analyze the MNIST data set of 
handwritten digits, using 12,665 training examples and 2,115 test examples of O’s and I’s. We per¬ 
form "unsupervised" classification, i.e., classify without using training labels: we apply a mixture of 
Gaussians to cluster, and then classify a digit based on its membership assignment, copula vi reports 
a test set error rate of 0.06, whereas lrvb ranges between 0.06 and 0.32 depending on the mean-field 
estimates, lrvb and similar higher order mean-field methods correct an existing mf solution—it is 
thus sensitive to local optima and the general quality of that solution. On the other hand, copula vi 
re-adjusts both the mf and copula parameters as it fits, making it more robust to initialization. 

4.2 Latent space model 

We next study inference on the latent space model [7], a Bernoulli latent factor model for network 
analysis. Each node in an A^-node network is associated with a P-dimensional latent variable z ^ 
N(fi, A^^). Edges between pairs of nodes are observed with high probability if the nodes are close 
to each other in the latent space. Eormally, an edge for each pair {i,j) is observed with probability 
logit(p) = 9 — \zi — Zj\, where 6* is a model parameter. 

We generate an W = 100,000 node network with latent node attributes from a P = 10 dimensional 
Gaussian. We learn the posterior of the latent attributes in order to predict the likelihood of held-out 
edges. MF applies independent factors on /r, A, 9 and z, lrvb applies a correction, and copula vi 
uses the fully dependent variational distribution. Table 1 displays the likelihood of held-out edges and 
runtime. We also attempted Hamiltonian Monte Carlo but it did not converge after five hours. 

Copula vi dominates other methods in accuracy upon convergence, and the copula estimation with¬ 
out refitting (2 steps) already dominates lrvb in both runtime and accuracy. We note however that 
LRVB requires one to invert a 0{NK^) x 0{NK^) matrix. We can better scale the method and 
achieve faster estimates than copula vi if we applied stochastic approximations for the inversion. 
However, copula vi always outperforms lrvb and is still fast on this 100,000 node network. 


5 Conclusion 


We developed copula variational inference (copula vi). copula vi is a new variational inference 
algorithm that augments the mean-field variational distribution with a copula; it captures posterior 
dependencies among the latent variables. We derived a scalable and generic algorithm for performing 
inference with this expressive variational distribution. We found that copula vi significantly reduces 
the bias of the mean-field approximation, better estimates the posterior variance, and is more accurate 
than other forms of capturing posterior dependency in variational approximations. 
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